Introduction

We’ve shown that clone-correction can reduce the power of \(\bar{r}_d\) to detect clonal reproduction, but it’s also clear that a reduced sample size contributes to this as well. To disentangle this, we assess the \(\bar{r}_d\) from 1000 replicates of n samples where n is the size of the data set after clone-correction. The data were processed via the analyze_jackknife_ia.R script. These were saved in the ma_jack_rda_files/ directory.

Note: this is the same analysis, utilizing more simulated alleles.

Setup

Packages

library('zksimanalysis')

Loading Data

res        <- load(here::here("data", "ma_full_results.rda")) # load original values of rd
jack_files <- list.files(here::here("data", "ma_jack_rda_files/"), full.names = TRUE) # load jackknife simulations
jacklist   <- setNames(jack_files, basename(jack_files))

for (i in jack_files){
  jacklist[basename(i)] <- load(i)
}
jackdat <- jacklist %>% 
  map_df(get, .id = "source") %>% 
  mutate(mutation_rate = ifelse(grepl("mutant", source), "even", "uneven")) %>%
  mutate(source = gsub("(^.+?feather).*$", "\\1.DATA.rda", source)) %>%
  select(matches("jack"), everything())
rm(list = jacklist)
gc()
##              used    (Mb) gc trigger    (Mb)   max used    (Mb)
## Ncells    5858331   312.9    9968622   532.4    6208520   331.6
## Vcells 1359759004 10374.2 1832699763 13982.4 1361662860 10388.7
# taking stock of how much data we have
jackdat %>% filter(!is.na(pop))
## # A tibble: 55,300 x 7
##    jack.p.Ia jack.p.rD     jack.ia     jack.rd
##        <dbl>     <dbl>      <list>      <list>
##  1     0.032     0.032 <dbl [999]> <dbl [999]>
##  2     0.002     0.002 <dbl [999]> <dbl [999]>
##  3     0.001     0.001 <dbl [999]> <dbl [999]>
##  4     0.004     0.004 <dbl [999]> <dbl [999]>
##  5     0.146     0.140 <dbl [999]> <dbl [999]>
##  6     0.046     0.001 <dbl [999]> <dbl [999]>
##  7     0.014     0.010 <dbl [999]> <dbl [999]>
##  8     0.018     0.008 <dbl [999]> <dbl [999]>
##  9     0.114     0.077 <dbl [999]> <dbl [999]>
## 10     0.055     0.038 <dbl [999]> <dbl [999]>
## # ... with 55,290 more rows, and 3 more variables: source <chr>,
## #   pop <chr>, mutation_rate <chr>
get(res)
## # A tibble: 79,992 x 54
##           Ia  p.Ia     rbarD  p.rD  samples.ia  samples.rd      Iacc
##        <dbl> <dbl>     <dbl> <dbl>      <list>      <list>     <dbl>
##  1 15.461693 0.001 0.7877455 0.001 <dbl [999]> <dbl [999]> 14.106736
##  2 14.052082 0.001 0.7073192 0.001 <dbl [999]> <dbl [999]> 12.350456
##  3 14.068551 0.001 0.7077485 0.001 <dbl [999]> <dbl [999]> 13.246838
##  4 13.625753 0.001 0.6849662 0.001 <dbl [999]> <dbl [999]> 12.401617
##  5 17.691915 0.001 0.9512911 0.001 <dbl [999]> <dbl [999]> 17.320580
##  6 17.432074 0.001 0.9279263 0.001 <dbl [999]> <dbl [999]> 17.178575
##  7 17.419960 0.001 0.9369048 0.001 <dbl [999]> <dbl [999]> 16.995387
##  8 17.007769 0.001 0.9029985 0.001 <dbl [999]> <dbl [999]> 15.596913
##  9  2.385947 0.001 0.3453806 0.001 <dbl [999]> <dbl [999]>  2.086592
## 10  2.786404 0.001 0.3593397 0.001 <dbl [999]> <dbl [999]>  2.658896
## # ... with 79,982 more rows, and 47 more variables: p.Iacc <dbl>,
## #   rbarDcc <dbl>, p.rDcc <dbl>, samples.iacc <list>, samples.rdcc <list>,
## #   pop <chr>, run <chr>, seed <chr>, sexrate <chr>, gen <chr>, rep <chr>,
## #   sample <fctr>, mutation_rate <chr>, H <dbl>, G <dbl>, lambda <dbl>,
## #   E.5 <dbl>, B <dbl>, uSimp <dbl>, NMLG <dbl>, H.var <dbl>, G.var <dbl>,
## #   lambda.var <dbl>, E.5.var <dbl>, B.var <dbl>, uSimp.var <dbl>,
## #   NMLG.var <dbl>, allele <dbl>, `1-D` <dbl>, Hexp <dbl>, Evenness <dbl>,
## #   tab <list>, allelecc <dbl>, `1-Dcc` <dbl>, Hexpcc <dbl>,
## #   Evennesscc <dbl>, tabcc <list>, H.locus <dbl>, G.locus <dbl>,
## #   lambda.locus <dbl>, E.5.locus <dbl>, uSimp.locus <dbl>, rrmat <list>,
## #   curve <list>, pgen <list>, CF <dbl>, source <chr>

Note that above, I’m filtering jackdat for non-missing populations. I should explain this a bit. When jack.ia was put into poppr, the requirements for running that n was greater than 2 and smaller than the total popualtion size. If the number of MLGs did not fit this requirement, tidy_jackia would return all NA, including that for population.

We can’t ask any questions regarding bias with missing data, so we are filtering it out here.

# combining original values and jackknife simulations
vals <- get(res) %>%
  inner_join(filter(jackdat, !is.na(pop)), # filtering out missing data
             by = c("source", "mutation_rate", "pop")) %>%
  select(rbarD, jack.rd, jack.p.rD, mutation_rate, everything())
vals
## # A tibble: 55,300 x 58
##        rbarD     jack.rd jack.p.rD mutation_rate        Ia  p.Ia  p.rD
##        <dbl>      <list>     <dbl>         <chr>     <dbl> <dbl> <dbl>
##  1 0.7877455 <dbl [999]>     0.147        uneven 15.461693 0.001 0.001
##  2 0.7073192 <dbl [999]>     0.021        uneven 14.052082 0.001 0.001
##  3 0.7077485 <dbl [999]>     0.151        uneven 14.068551 0.001 0.001
##  4 0.6849662 <dbl [999]>     0.069        uneven 13.625753 0.001 0.001
##  5 0.9512911 <dbl [999]>     0.219        uneven 17.691915 0.001 0.001
##  6 0.9279263 <dbl [999]>     0.127        uneven 17.432074 0.001 0.001
##  7 0.9369048 <dbl [999]>     0.019        uneven 17.419960 0.001 0.001
##  8 0.9029985 <dbl [999]>     0.004        uneven 17.007769 0.001 0.001
##  9 0.3453806 <dbl [999]>     0.223        uneven  2.385947 0.001 0.001
## 10 0.3593397 <dbl [999]>     0.233        uneven  2.786404 0.001 0.001
## # ... with 55,290 more rows, and 51 more variables: samples.ia <list>,
## #   samples.rd <list>, Iacc <dbl>, p.Iacc <dbl>, rbarDcc <dbl>,
## #   p.rDcc <dbl>, samples.iacc <list>, samples.rdcc <list>, pop <chr>,
## #   run <chr>, seed <chr>, sexrate <chr>, gen <chr>, rep <chr>,
## #   sample <fctr>, H <dbl>, G <dbl>, lambda <dbl>, E.5 <dbl>, B <dbl>,
## #   uSimp <dbl>, NMLG <dbl>, H.var <dbl>, G.var <dbl>, lambda.var <dbl>,
## #   E.5.var <dbl>, B.var <dbl>, uSimp.var <dbl>, NMLG.var <dbl>,
## #   allele <dbl>, `1-D` <dbl>, Hexp <dbl>, Evenness <dbl>, tab <list>,
## #   allelecc <dbl>, `1-Dcc` <dbl>, Hexpcc <dbl>, Evennesscc <dbl>,
## #   tabcc <list>, H.locus <dbl>, G.locus <dbl>, lambda.locus <dbl>,
## #   E.5.locus <dbl>, uSimp.locus <dbl>, rrmat <list>, curve <list>,
## #   pgen <list>, CF <dbl>, source <chr>, jack.p.Ia <dbl>, jack.ia <list>

Data Exploring!

First, let’s see what we have after the filtering:

vals %>%
  group_by(mutation_rate, sexrate, sample) %>%
  summarize(n = n()) %>%
  spread(sample, n) %>%
  knitr::kable()
mutation_rate sexrate 10 25 50 100
even 0.0000 962 997 1000 1000
even 0.0001 975 995 1000 1000
even 0.0005 985 999 999 999
even 0.0010 954 1000 1000 1000
even 0.0050 579 996 1000 1000
even 0.0100 371 939 1000 1000
even 0.0500 79 473 925 1000
even 0.1000 50 262 737 995
even 0.5000 5 37 140 472
even 1.0000 NA 1 NA NA
uneven 0.0000 927 1000 1000 1000
uneven 0.0001 910 999 1000 1000
uneven 0.0005 878 999 999 999
uneven 0.0010 787 1000 1000 1000
uneven 0.0050 501 985 1000 1000
uneven 0.0100 328 916 1000 1000
uneven 0.0500 76 462 922 1000
uneven 0.1000 50 258 732 995
uneven 0.5000 5 37 140 469

Now, let’s visualize the “p-value” for each jack knife analysis. This p-value represents the fraction of jack-knife simulations less than or equal to the clone-corrected value.

In essence, we are asking the following question:

Can the clone-corrected value of \(\bar{r}_d\) be attributed to a small sample size?

vals %>%
  ggplot(aes(x = jack.p.rD, group = mutation_rate, fill = mutation_rate)) +
  geom_histogram(alpha = 0.5, binwidth = 0.05) +
  facet_grid(sexrate~sample, scales = "free")

Let’s take a look at the values when the pvalue for \(\bar{r}_d\) is < 0.05:

vals %>%
  filter(p.rD < p.rDcc, p.rDcc > 0.05) %>%
  group_by(sexrate, sample) %>%
  summarize(reject = sprintf("%.2f (%d)", sum(jack.p.rD < 0.05)/n(), n())) %>%
  spread(sample, reject)
## # A tibble: 10 x 5
## # Groups:   sexrate [10]
##    sexrate       `10`        `25`        `50`       `100`
##  *   <chr>      <chr>       <chr>       <chr>       <chr>
##  1  0.0000 0.11 (333)  0.57 (260)  0.62 (248)  0.66 (230)
##  2  0.0001 0.10 (272)  0.60 (229)  0.61 (192)  0.68 (183)
##  3  0.0005 0.30 (133)   1.00 (20)    1.00 (6)    1.00 (2)
##  4  0.0010 0.25 (269)   1.00 (43)   1.00 (11)    1.00 (7)
##  5  0.0050 0.03 (721)  0.93 (810)  0.99 (475)  1.00 (250)
##  6  0.0100 0.01 (605) 0.76 (1282) 0.96 (1096)  0.99 (768)
##  7  0.0500 0.00 (143)  0.26 (852) 0.57 (1648) 0.71 (1750)
##  8  0.1000  0.00 (93)  0.12 (483) 0.36 (1309) 0.47 (1752)
##  9  0.5000  0.00 (10)   0.03 (69)  0.21 (248)  0.17 (750)
## 10  1.0000       <NA>    0.00 (1)        <NA>        <NA>
vals %>%
  filter(p.rD < p.rDcc, p.rDcc > 0.05) %>%
  ggplot(aes(x = jack.p.rD, group = mutation_rate, fill = mutation_rate)) +
  geom_histogram(alpha = 0.5, binwidth = 0.05) +
  facet_grid(sexrate~sample, scales = "free_y")

Yeah, those don’t look too good. The graph above is basically saying that we would consider most of the clonal populations to be sexually derived because the clone-corrected value is lower than the distribution. Of course, this is expecting that the mean value for the distribution is near the observed value of the whole data set. This may not be true.

To correct for this, I’m going to calculate the bias a la Section 10.3 in Efron and Tibshirani by subtracting the observed value from the bootstrap estimate:

\[ \hat\theta_B = \frac{1}{B} \sum\theta^*\\ Bias[\hat\theta] = \hat\theta_B - \hat\theta \]

Once I have the bias, I’m going to compare the bias-corrected value of the clone-corrected \(\bar{r}_d\) to the null distribution of the full data set. This would allow us to see if the reduced sample size affected the data (good example from http://www.stat.umn.edu/geyer/5601/examp/bias.html).

Note to future Zhian:

The answer may not be as simple as accounting for bias due to sample size, but also bias due to duplicated samples. Perhaps the way to go is to subtract the sum of the biases from the data.

makep <- function(x, y) (sum(x <= y, na.rm = TRUE) + 1)/(length(y) + 1)
bvals <- vals %>%
  filter(p.rD < p.rDcc, p.rDcc > 0.05) %>%
  mutate(jack.mean = map_dbl(jack.rd, mean, na.rm = TRUE)) %>%
  mutate(bias = jack.mean - rbarD) %>% # calculating bias a la Effron
  select(bias, jack.mean, everything()) %>%
  mutate(jack.p.rDb = map2_dbl(rbarDcc - bias, samples.rd, makep)) %>%
  select(bias, jack.p.rD, jack.p.rDb, everything())

Now we can visualize this as a table of the percentage rejection.

bvals %>%
  group_by(sexrate, sample) %>%
  summarize(reject = sprintf("%.2f (%.4f)", 
                             sum(jack.p.rDb < 0.05)/n(), 
                             median(bias, na.rm = TRUE))) %>%
  spread(sample, reject) %>%
  knitr::kable(digits = 3, caption = "Null Hypothesis Rejected (median bias)")
Null Hypothesis Rejected (median bias)
sexrate 10 25 50 100
0.0000 0.08 (0.0420) 0.03 (0.0350) 0.00 (0.0377) 0.00 (0.0322)
0.0001 0.11 (0.0337) 0.03 (0.0349) 0.03 (0.0402) 0.01 (0.0393)
0.0005 0.02 (0.0362) 0.05 (0.0180) 0.33 (0.0127) 0.50 (0.0078)
0.0010 0.06 (0.0087) 0.28 (0.0072) 0.27 (0.0063) 0.29 (0.0049)
0.0050 0.04 (0.0014) 0.08 (0.0005) 0.23 (0.0005) 0.38 (0.0006)
0.0100 0.03 (0.0011) 0.04 (0.0002) 0.11 (0.0002) 0.22 (0.0002)
0.0500 0.01 (0.0009) 0.01 (0.0000) 0.01 (0.0000) 0.02 (0.0000)
0.1000 0.00 (0.0008) 0.01 (0.0000) 0.01 (0.0000) 0.01 (0.0000)
0.5000 0.00 (0.0003) 0.00 (0.0000) 0.00 (0.0000) 0.01 (0.0000)
1.0000 NA 0.00 (0.0000) NA NA
bvals %>%
  ggplot(aes(x = p.rDcc, group = mutation_rate, fill = mutation_rate)) +
  geom_density(alpha = 0.75) +
  geom_rug(alpha = 0.5) +
  facet_grid(sexrate~sample, scales = "free_y") +
  scale_fill_viridis(option = "A", discrete = TRUE) +
  ggtitle("Clone-censored p-values without bias correction") +
  xlab("P-value")

bvals %>%
  ggplot(aes(x = jack.p.rDb, group = mutation_rate, fill = mutation_rate)) +
  geom_density(alpha = 0.75) +
  geom_rug(alpha = 0.5) +
  facet_grid(sexrate~sample, scales = "free_y") +
  scale_fill_viridis(option = "D", discrete = TRUE) +
  ggtitle("P-values after bias correction of clone-censored data") +
  xlab("P-value (bias-corrected)")

bvals %>%
  unite(the_group, seed, rep, run, gen) %>%
  ggplot(aes(x = sexrate, y = bias, fill = mutation_rate)) +
  geom_line(aes(group = the_group), alpha = 0.2) +
  geom_boxplot(aes(group = factor(sexrate))) +
  # geom_point(aes(group = the_group), alpha = 0.2) +
  facet_grid(sample ~ mutation_rate)

Well, it turns out that things got worse, folks. Perhaps the answer is that we need to consider the fact that there is over-representation of the clones. If we parameterize the resampling procedure by weighting the samples by the value of \(p_{sex}\), then we can account for this over-representation. Consider that it is possible for a clonal genotype to be identical in state without being identical by descent. This is roughly what \(p_{sex}\) measures. When we weight the resamplings with \(p_{sex}\), it’s more likely for us to grab a smaller population that’s representative of what the underlying genotypes are. We can use that to calculate the bias due to smaller sample size and use that to correct the clone-corrected estimate.

Assessing jackknife analysis weighted by \(p_{sex}\)

I went back and created new analyses, weighted by \(p_{sex}\). To avoid wasting memory, I’m going to remove the unnecessary data we have thus far:

rm(vals, jackdat)
gc()
##              used   (Mb) gc trigger    (Mb)   max used    (Mb)
## Ncells    5882490  314.2    9968622   532.4    9968622   532.4
## Vcells 1285107985 9804.6 2199319715 16779.5 1437250443 10965.4

Now, I can load in the new results:

jack_files <- list.files(here::here("data", "ma_jack_psex_rda_files/"), full.names = TRUE) # load jackknife simulations
jacklist   <- setNames(jack_files, basename(jack_files))

for (i in jack_files){
  jacklist[basename(i)] <- load(i)
}
jackdat <- jacklist %>% 
  map_df(get, .id = "source") %>% 
  mutate(mutation_rate = ifelse(grepl("mutant", source), "even", "uneven")) %>%
  mutate(source = gsub("(^.+?feather).*$", "\\1.DATA.rda", source)) %>%
  select(matches("jack"), everything())
rm(list = jacklist)
gc()
##              used    (Mb) gc trigger    (Mb)   max used    (Mb)
## Ncells    6043189   322.8    9968622   532.4    9968622   532.4
## Vcells 1396209184 10652.3 2199319715 16779.5 1437250443 10965.4
# taking stock of how much data we have
jackdat %>% filter(!is.na(pop))
## # A tibble: 55,300 x 7
##    jack.p.Ia jack.p.rD     jack.ia     jack.rd
##        <dbl>     <dbl>      <list>      <list>
##  1         1         1 <dbl [999]> <dbl [999]>
##  2         1         1 <dbl [999]> <dbl [999]>
##  3         1         1 <dbl [999]> <dbl [999]>
##  4         1         1 <dbl [999]> <dbl [999]>
##  5         1         1 <dbl [999]> <dbl [999]>
##  6         1         1 <dbl [999]> <dbl [999]>
##  7         1         1 <dbl [999]> <dbl [999]>
##  8         1         1 <dbl [999]> <dbl [999]>
##  9         1         1 <dbl [999]> <dbl [999]>
## 10         1         1 <dbl [999]> <dbl [999]>
## # ... with 55,290 more rows, and 3 more variables: source <chr>,
## #   pop <chr>, mutation_rate <chr>
# combining original values and jackknife simulations
vals <- get(res) %>%
  inner_join(filter(jackdat, !is.na(pop)), # filtering out missing data
             by = c("source", "mutation_rate", "pop")) %>%
  select(rbarD, jack.rd, jack.p.rD, mutation_rate, everything())
vals
## # A tibble: 55,300 x 58
##        rbarD     jack.rd jack.p.rD mutation_rate        Ia  p.Ia  p.rD
##        <dbl>      <list>     <dbl>         <chr>     <dbl> <dbl> <dbl>
##  1 0.7877455 <dbl [999]>         1        uneven 15.461693 0.001 0.001
##  2 0.7073192 <dbl [999]>         1        uneven 14.052082 0.001 0.001
##  3 0.7077485 <dbl [999]>         1        uneven 14.068551 0.001 0.001
##  4 0.6849662 <dbl [999]>         1        uneven 13.625753 0.001 0.001
##  5 0.9512911 <dbl [999]>         1        uneven 17.691915 0.001 0.001
##  6 0.9279263 <dbl [999]>         1        uneven 17.432074 0.001 0.001
##  7 0.9369048 <dbl [999]>         1        uneven 17.419960 0.001 0.001
##  8 0.9029985 <dbl [999]>         1        uneven 17.007769 0.001 0.001
##  9 0.3453806 <dbl [999]>         1        uneven  2.385947 0.001 0.001
## 10 0.3593397 <dbl [999]>         1        uneven  2.786404 0.001 0.001
## # ... with 55,290 more rows, and 51 more variables: samples.ia <list>,
## #   samples.rd <list>, Iacc <dbl>, p.Iacc <dbl>, rbarDcc <dbl>,
## #   p.rDcc <dbl>, samples.iacc <list>, samples.rdcc <list>, pop <chr>,
## #   run <chr>, seed <chr>, sexrate <chr>, gen <chr>, rep <chr>,
## #   sample <fctr>, H <dbl>, G <dbl>, lambda <dbl>, E.5 <dbl>, B <dbl>,
## #   uSimp <dbl>, NMLG <dbl>, H.var <dbl>, G.var <dbl>, lambda.var <dbl>,
## #   E.5.var <dbl>, B.var <dbl>, uSimp.var <dbl>, NMLG.var <dbl>,
## #   allele <dbl>, `1-D` <dbl>, Hexp <dbl>, Evenness <dbl>, tab <list>,
## #   allelecc <dbl>, `1-Dcc` <dbl>, Hexpcc <dbl>, Evennesscc <dbl>,
## #   tabcc <list>, H.locus <dbl>, G.locus <dbl>, lambda.locus <dbl>,
## #   E.5.locus <dbl>, uSimp.locus <dbl>, rrmat <list>, curve <list>,
## #   pgen <list>, CF <dbl>, source <chr>, jack.p.Ia <dbl>, jack.ia <list>
vals %>%
  group_by(mutation_rate, sexrate, sample) %>%
  summarize(n = n()) %>%
  spread(sample, n) %>%
  knitr::kable()
mutation_rate sexrate 10 25 50 100
even 0.0000 962 997 1000 1000
even 0.0001 975 995 1000 1000
even 0.0005 985 999 999 999
even 0.0010 954 1000 1000 1000
even 0.0050 579 996 1000 1000
even 0.0100 371 939 1000 1000
even 0.0500 79 473 925 1000
even 0.1000 50 262 737 995
even 0.5000 5 37 140 472
even 1.0000 NA 1 NA NA
uneven 0.0000 927 1000 1000 1000
uneven 0.0001 910 999 1000 1000
uneven 0.0005 878 999 999 999
uneven 0.0010 787 1000 1000 1000
uneven 0.0050 501 985 1000 1000
uneven 0.0100 328 916 1000 1000
uneven 0.0500 76 462 922 1000
uneven 0.1000 50 258 732 995
uneven 0.5000 5 37 140 469
vals %>%
  ggplot(aes(x = jack.p.rD, group = mutation_rate, fill = mutation_rate)) +
  geom_histogram(alpha = 0.5, binwidth = 0.05) +
  facet_grid(sexrate~sample, scales = "free") +
  ggtitle("P-value from jack-knife analysis weighted by psex")

Okay this result is almost the complete opposite as compared to the one above. Let’s take a look at the values when the pvalue for \(\bar{r}_d\) is < 0.05:

vals %>%
  filter(p.rD < p.rDcc, p.rDcc > 0.05) %>%
  group_by(sexrate, sample) %>%
  summarize(reject = sprintf("%.2f (%d)", sum(jack.p.rD < 0.05)/n(), n())) %>%
  spread(sample, reject)
## # A tibble: 10 x 5
## # Groups:   sexrate [10]
##    sexrate       `10`        `25`        `50`       `100`
##  *   <chr>      <chr>       <chr>       <chr>       <chr>
##  1  0.0000 0.00 (333)  0.00 (260)  0.00 (248)  0.00 (230)
##  2  0.0001 0.00 (272)  0.00 (229)  0.00 (192)  0.00 (183)
##  3  0.0005 0.00 (133)   0.00 (20)    0.00 (6)    0.00 (2)
##  4  0.0010 0.00 (269)   0.00 (43)   0.00 (11)    0.00 (7)
##  5  0.0050 0.00 (721)  0.00 (810)  0.00 (475)  0.00 (250)
##  6  0.0100 0.00 (605) 0.00 (1282) 0.00 (1096)  0.00 (768)
##  7  0.0500 0.00 (143)  0.00 (852) 0.00 (1648) 0.00 (1750)
##  8  0.1000  0.00 (93)  0.00 (483) 0.00 (1309) 0.00 (1752)
##  9  0.5000  0.00 (10)   0.00 (69)  0.00 (248)  0.00 (750)
## 10  1.0000       <NA>    0.00 (1)        <NA>        <NA>
vals %>%
  filter(p.rD < p.rDcc, p.rDcc > 0.05) %>%
  ggplot(aes(x = jack.p.rD, group = mutation_rate, fill = mutation_rate)) +
  geom_histogram(alpha = 0.5, binwidth = 0.05) +
  facet_grid(sexrate~sample, scales = "free_y")

Bias correction

makep <- function(x, y) (sum(x <= y, na.rm = TRUE) + 1)/(length(y) + 1)
bvals <- vals %>%
  filter(p.rD < p.rDcc, p.rDcc > 0.05) %>%
  mutate(jack.mean = map_dbl(jack.rd, mean, na.rm = TRUE)) %>%
  mutate(bias = jack.mean - rbarD) %>% # calculating bias a la Effron
  select(bias, jack.mean, everything()) %>%
  mutate(jack.p.rDb = map2_dbl(rbarDcc - bias, samples.rdcc, makep)) %>%
  select(bias, jack.p.rD, jack.p.rDb, everything())

Now we can visualize this as a table of the percentage rejection.

bvals %>%
  group_by(sexrate, sample) %>%
  summarize(reject = sprintf("%.2f (%.4f)", 
                             sum(jack.p.rDb < 0.01)/n(), 
                             mean(bias[is.finite(bias)], na.rm = TRUE))) %>%
  spread(sample, reject) %>%
  knitr::kable(digits = 3, caption = "Null Hypothesis Rejected (median bias)")
Null Hypothesis Rejected (median bias)
sexrate 10 25 50 100
0.0000 0.50 (-0.2114) 0.62 (-0.1246) 0.69 (-0.0990) 0.70 (-0.0874)
0.0001 0.57 (-0.2494) 0.64 (-0.1403) 0.71 (-0.1060) 0.75 (-0.0891)
0.0005 0.92 (-0.2222) 0.95 (-0.1570) 1.00 (-0.1534) 1.00 (-0.1170)
0.0010 0.87 (-0.1275) 0.98 (-0.0993) 1.00 (-0.0903) 1.00 (-0.0774)
0.0050 0.46 (-0.0540) 0.64 (-0.0270) 0.89 (-0.0253) 0.97 (-0.0252)
0.0100 0.31 (-0.0437) 0.32 (-0.0153) 0.59 (-0.0136) 0.84 (-0.0139)
0.0500 0.15 (-0.0366) 0.02 (-0.0064) 0.03 (-0.0032) 0.05 (-0.0028)
0.1000 0.14 (-0.0338) 0.01 (-0.0055) 0.01 (-0.0020) 0.01 (-0.0015)
0.5000 0.30 (-0.0409) 0.00 (-0.0053) 0.00 (-0.0013) 0.00 (-0.0005)
1.0000 NA 0.00 (-0.0073) NA NA

Now this is showing promise. To recap, this table is derived from the situations in which \(\bar{r}_d\) was significant for the full data set, but not for the clone-corrected data set. The values represented are the fraction of clone-corrected populations that rejected the null hypothesis after bias-correction.

A couple of notes about this table. I had to specify finite values for the bias because for sexrates less than 0.0005 and a sample size of 10, we saw some levels of bias less than -1, and a couple that were infinite. I’ll plot them here for your enjoyment.

infbias <- bvals %>% filter(bias < -1) %>% 
  select(pop, jack.rd, rbarDcc, NMLG)
infbias %>%
  unnest() %>%
  ggplot(aes(x = jack.rd)) +
  geom_density(aes(fill = NMLG)) +
  geom_rug() +
  geom_vline(aes(xintercept = rbarDcc)) +
  facet_wrap(~pop, ncol  = 1)

What it’s showing is that when the number of MLGs are reduced to 3, the value of \(\bar{r}_d\) can be really negative!

Now back to the show.

bvals %>%
  ggplot(aes(x = p.rDcc, group = mutation_rate, fill = mutation_rate)) +
  geom_density(alpha = 0.75) +
  geom_rug(alpha = 0.5) +
  facet_grid(sexrate~sample, scales = "free_y") +
  scale_fill_viridis(option = "A", discrete = TRUE) +
  ggtitle("Clone-censored p-values without bias correction") +
  xlab("P-value")

bvals %>%
  ggplot(aes(x = jack.p.rDb, group = mutation_rate, fill = mutation_rate)) +
  geom_density(alpha = 0.75) +
  geom_rug(alpha = 0.5) +
  facet_grid(sexrate~sample, scales = "free_y") +
  scale_fill_viridis(option = "D", discrete = TRUE) +
  ggtitle("P-values after bias correction of clone-censored data") +
  xlab("P-value (bias-corrected)")

bvals %>%
  unite(the_group, seed, rep, run, gen) %>%
  ggplot(aes(x = sexrate, y = bias, fill = mutation_rate)) +
  geom_line(aes(group = the_group), alpha = 0.2) +
  geom_boxplot(aes(group = factor(sexrate))) +
  # geom_point(aes(group = the_group), alpha = 0.2) +
  facet_grid(sample ~ mutation_rate) +
  ggtitle("Progression of bias", subtitle = "full data set compared to sub-sample")

Flaws with “bias correction”

After thinking about this for a little bit, I realized that this method is flawed because the bias is greater if the clone-corrected data is closer to zero. An example can be shown with the “nancycats” data set in adegenet, documenting 17 colonies of cats in Nancy, France. Since we know cats don’t clone, we can trust that this data set comes from sexual organisms.

data("nancycats")
nan7 <- nancycats[pop = 7]
set.seed(999)
nan72 <- nan7[sample(nInd(nan7), nInd(nan7)*2, replace = TRUE)]
indNames(nan72) <- paste("sample", seq(nInd(nan72)))
nan72 <- as.genclone(nan72)
nan7ia    <- ia(nan7, sample = 999, valuereturn = TRUE, quiet = TRUE, plot = FALSE)
nan72ia   <- ia(nan72, sample = 999, valuereturn = TRUE, quiet = TRUE, plot = FALSE)
nan72ccia <- ia(clonecorrect(nan72, NA), sample = 999, valuereturn = TRUE, quiet = TRUE, plot = FALSE)
## Warning in clonecorrect(nan72, NA): Strata is not set for nan72. Clone
## correct will be performed without population information.
nanjack   <- jack.ia(nan72, use_psex = TRUE, method = 'multiple', mul = 2, rep = 999, quiet = TRUE)
idx       <- c("Ia", "rbarD")
nan72bi   <- nan72ccia
# Calculating the bias-correction
nan72bi$index[idx] <- nan72ccia$index[idx] - map2_dbl(nanjack, nan72ia$index[idx], function(a, b) (mean(a) - b))
nan72bi$index[c(FALSE, TRUE)] <- map2_dbl(nan72bi$index[idx], nan72bi$samples, makep)
plot(nan7ia) + ggtitle("Original Data")

plot(nan72ia) + ggtitle(paste("Resampled Data with", nmll(nan72), "unique samples"))

plot(nan72ccia) + ggtitle(paste("Resampled Data (clone-corrected)"))

plot(nan72bi) + ggtitle(paste("Resampled Data (bias-corrected)"))

We would have declared this population incorrectly to be clonal, but if we simply used the mean of the jack-knifed samples (accounting for \(p_{sex}\)), 0.042, we would not have been able to reject it.

Using resampling with \(p_{sex}\) as an estimate

Here, we will use the mean of the jack-knifed samples as the estimate and re-calculate the p-value from there. Note that we are using the null distribution generated from the full data set to

jvals <- vals %>%
  filter(p.rD < p.rDcc, p.rDcc > 0.05) %>%
  mutate(jack.mean = map_dbl(jack.rd, mean, na.rm = TRUE)) %>%
  mutate(bias = jack.mean - rbarDcc) %>%
  mutate(jack.p = map2_dbl(jack.mean, samples.rd, makep)) %>%     # against the full null
  mutate(jack.pcc = map2_dbl(jack.mean, samples.rdcc, makep)) %>% # against the clone-corrected null
  select(jack.pcc, everything())

What I show here are the results when compared against the full null distribution, which exhibits less variance than the clone-corrected null. When compared against the clone-corrected null, there were no cases where the null hypothesis was rejected at the 0.01 level.

There are three tables:

  • Using the mean of the resampled distribution at p ≤ 0.01
  • Same as above OR \(E_{5A} \geq 0.85\)
  • The p ≤ 0.01 AND \(E_{5A} \geq 0.85\)
jvals %>%
  group_by(sexrate, sample) %>%
  summarize(reject = sum(jack.p <= 0.01, na.rm = TRUE)/n()) %>%
  spread(sample, reject) %>%
  knitr::kable(digits = 3, caption = "Null Hypothesis Rejected (p <= 0.01)")
Null Hypothesis Rejected (p <= 0.01)
sexrate 10 25 50 100
0.0000 0.060 0.092 0.113 0.148
0.0001 0.066 0.096 0.130 0.120
0.0005 0.105 0.400 0.833 1.000
0.0010 0.037 0.209 0.545 0.571
0.0050 0.001 0.000 0.008 0.064
0.0100 0.000 0.000 0.000 0.001
0.0500 0.000 0.000 0.000 0.000
0.1000 0.000 0.000 0.000 0.000
0.5000 0.000 0.000 0.000 0.000
1.0000 NA 0.000 NA NA
jvals %>%
  group_by(sexrate, sample) %>%
  summarize(reject = sum(jack.p <= 0.01 | Evennesscc > 0.85, na.rm = TRUE)/n()) %>%
  spread(sample, reject) %>%
  knitr::kable(digits = 3, caption = "Null Hypothesis Rejected (p <= 0.01 or E5A > 0.85)")
Null Hypothesis Rejected (p <= 0.01 or E5A > 0.85)
sexrate 10 25 50 100
0.0000 0.997 0.992 1.000 1.000
0.0001 0.996 0.996 1.000 0.995
0.0005 0.218 0.450 0.833 1.000
0.0010 0.089 0.256 0.545 0.571
0.0050 0.025 0.002 0.011 0.064
0.0100 0.030 0.005 0.002 0.003
0.0500 0.021 0.002 0.002 0.003
0.1000 0.043 0.017 0.002 0.001
0.5000 0.200 0.000 0.008 0.000
1.0000 NA 0.000 NA NA
jvals %>%
  group_by(sexrate, sample) %>%
  summarize(reject = sum(jack.p <= 0.01 & Evennesscc > 0.85, na.rm = TRUE)/n()) %>%
  spread(sample, reject) %>%
  knitr::kable(digits = 3, caption = "Null Hypothesis Rejected (p <= 0.01 and E5A > 0.85)")
Null Hypothesis Rejected (p <= 0.01 and E5A > 0.85)
sexrate 10 25 50 100
0.0000 0.060 0.092 0.109 0.148
0.0001 0.066 0.096 0.130 0.120
0.0005 0.023 0.100 0.167 0.000
0.0010 0.000 0.047 0.000 0.000
0.0050 0.001 0.000 0.000 0.000
0.0100 0.000 0.000 0.000 0.000
0.0500 0.000 0.000 0.000 0.000
0.1000 0.000 0.000 0.000 0.000
0.5000 0.000 0.000 0.000 0.000
1.0000 NA 0.000 NA NA
jvals %>%
  ggplot(aes(x = jack.p, group = mutation_rate, fill = mutation_rate)) +
  geom_histogram(alpha = 0.5, binwidth = 0.05) +
  facet_grid(sexrate~sample, scales = "free") +
  ggtitle("P-value from jack-knife analysis weighted by psex")

I thought that maybe there might be some effect in bias where clonal samples would exhibit a greater bias, but this does not appear to be the case.

jvals %>%
  unite(the_group, seed, rep, run, gen) %>%
  ggplot(aes(x = sexrate, y = bias, fill = mutation_rate)) +
  # geom_line(aes(group = the_group), alpha = 0.2) +
  geom_boxplot(aes(group = factor(sexrate))) +
  # geom_point(aes(group = the_group), alpha = 0.2) +
  facet_grid(sample ~ mutation_rate) +
  scale_y_log10() +
  ggtitle("Progression of bias", subtitle = "full data set compared to sub-sample")
## Warning in self$trans$transform(x): NaNs produced
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 14454 rows containing non-finite values (stat_boxplot).

Conclusions

Clone-correction provides a means of assessing \(\bar{r}_d\) without the bias from duplicated genotypes (which tends to inflate the statistic) at the cost of bias from a decreased sample size. It is intuitive that we can account for the sample size bias by randomly sampling a subset of the data without replacement and recalculating \(\bar{r}_d\). The problem is that, while this removes the bias due to sample size, it’s still affected by the bias introduced from over-represented genotypes.

We can address both of these biases at the same time by weighting the samples by the probability of encountering the nth sample of the same genotype by chance, also known as \(p_{sex}\). This allows identical genotypes from different genets to be included in the analysis and gives us a better idea of what the population without identical-by-descent genotypes looks like.

An example of how this works can be seen in real data from the outbreak of Phytophthora ramorum in Curry County, OR.

library("poppr")
data(Pram)
iard <- c("Ia", "rbarD") # vector to get stats from results
JHC <- Pram[pop = "JHallCr_OR"] # grabbing the Joe Hall Creek watershed
# calculating allele frequencies from entire epidemic
af  <- rraf(Pram, by_pop = FALSE, mul = 2, res = "vector")
jia   <- ia(JHC, sample = 999, quiet = TRUE, valuereturn = TRUE)

jiacc <- JHC %>% clonecorrect(NA) %>% ia(sample = 999, quiet = TRUE, valuereturn = TRUE)

jjak  <- jack.ia(JHC, quiet = TRUE, reps = 999, use_psex = TRUE, method = "multiple", freq = af)
jjak2 <- jack.ia(JHC, quiet = TRUE, reps = 999)
(jbias <- map_dbl(jjak, mean, na.rm = TRUE))
##         Ia      rbarD 
## 0.09525167 0.02444025
(jiap  <- map2_dbl(jbias, jia$samples, function(a, b) (sum(a <= b, na.rm = TRUE) + 1)/(length(b) + 1)))
##    Ia rbarD 
## 0.002 0.002
df <- data_frame(rbarD = c(jia$index[3], jiacc$index[3], jbias[2]),
                 type = c("observed", "clone-censored", "estimate"))
bind_rows(null = jia$samples,
          nullcc = jiacc$samples,
          resampled.psex = jjak,
          resampled = jjak2,
          .id = "type") %>%
  ggplot(aes(x = rbarD)) +
  geom_density(aes(fill = type), alpha = 0.5) +
  geom_rug(aes(color = type), alpha = 0.1) +
  geom_vline(xintercept = df$rbarD, lty = 1:3) +
  annotate(geom = "text", x = df$rbarD, y = 10*c(3, 5, 4), label = df$type, hjust = c(1.05, 1.05, -0.05)) +
  scale_fill_brewer(palette = "Set1", labels = c("Null", "Null (clone-censored)", "Resampled", "Resampled (Psex)")) +
  scale_color_brewer(palette = "Set1", labels = c("Null", "Null (clone-censored)", "Resampled", "Resampled (Psex)")) +
  theme_bw(base_size = 16, base_family = "Helvetica") +
  theme(aspect.ratio = 0.68) +
  theme(legend.position = "top") +
  xlim(c(-0.06, 0.3)) +
  theme(panel.border = element_blank()) +
  theme(panel.grid.major.y = element_line(color = "grey70")) +
  theme(panel.grid.minor.y = element_line(color = "grey50"))
## Warning: Removed 33 rows containing non-finite values (stat_density).

What we can see from this example is that the estimate here would easily be considered clonal when compared to the Full null distribution. However, when compared to the clone-censored null, the estimate is still non-significant due to the wide variance exhibited. Another issue arises when considering the fact that this can only be performed on a small number of markers considering that \(p_{gen}\), and therefore \(p_{sex}\) approaches zero as the number of loci increases. However, we can see that this approach does detect the non-random mating in populations with a sex rate >1%.

Ultimately, there is still no good way of assessing clonality. We must accept the fact that \(\bar{r}_d\) was simply not designed to be this sensitive to different levels of sexual reproduction. It may be more important to think about what the biological consequence of 1% sexual reproduction in a population is.